用于视觉语言表示学习的变压器已经引起了很多兴趣,并在视觉问题答案(VQA)和接地方面表现出了巨大的表现。但是,大多数显示出良好性能的系统在培训过程中仍然依赖于预训练的对象探测器,这将其适用性限制在可用于这些检测器的对象类中。为了减轻这种限制,以下论文着重于在变形金刚中的视觉问题答案的背景下进行弱监督的基础问题。该方法通过将每个视觉令牌分组在视觉编码器中,并使用语言自我发项层作为文本引导选择模块来利用胶囊,以在将它们转发到下一层之前掩盖它们。我们评估了针对挑战的GQA以及VQA帽数据集的VQA接地的方法。我们的实验表明:在从标准变压器体系结构中删除蒙版对象的信息的同时,胶囊的集成显着提高了此类系统的接地能力,并提供了与其他新的最先进的结果。在现场接近。
translated by 谷歌翻译
As the accuracy of machine learning models increases at a fast rate, so does their demand for energy and compute resources. On a low level, the major part of these resources is consumed by data movement between different memory units. Modern hardware architectures contain a form of fast memory (e.g., cache, registers), which is small, and a slow memory (e.g., DRAM), which is larger but expensive to access. We can only process data that is stored in fast memory, which incurs data movement (input/output-operations, or I/Os) between the two units. In this paper, we provide a rigorous theoretical analysis of the I/Os needed in sparse feedforward neural network (FFNN) inference. We establish bounds that determine the optimal number of I/Os up to a factor of 2 and present a method that uses a number of I/Os within that range. Much of the I/O-complexity is determined by a few high-level properties of the FFNN (number of inputs, outputs, neurons, and connections), but if we want to get closer to the exact lower bound, the instance-specific sparsity patterns need to be considered. Departing from the 2-optimal computation strategy, we show how to reduce the number of I/Os further with simulated annealing. Complementing this result, we provide an algorithm that constructively generates networks with maximum I/O-efficiency for inference. We test the algorithms and empirically verify our theoretical and algorithmic contributions. In our experiments on real hardware we observe speedups of up to 45$\times$ relative to the standard way of performing inference.
translated by 谷歌翻译
We develop Few-Shot Learning models trained to recognize five or ten different dynamic hand gestures, respectively, which are arbitrarily interchangeable by providing the model with one, two, or five examples per hand gesture. All models were built in the Few-Shot Learning architecture of the Relation Network (RN), in which Long-Short-Term Memory cells form the backbone. The models use hand reference points extracted from RGB-video sequences of the Jester dataset which was modified to contain 190 different types of hand gestures. Result show accuracy of up to 88.8% for recognition of five and up to 81.2% for ten dynamic hand gestures. The research also sheds light on the potential effort savings of using a Few-Shot Learning approach instead of a traditional Deep Learning approach to detect dynamic hand gestures. Savings were defined as the number of additional observations required when a Deep Learning model is trained on new hand gestures instead of a Few Shot Learning model. The difference with respect to the total number of observations required to achieve approximately the same accuracy indicates potential savings of up to 630 observations for five and up to 1260 observations for ten hand gestures to be recognized. Since labeling video recordings of hand gestures implies significant effort, these savings can be considered substantial.
translated by 谷歌翻译
Artificial Intelligence (AI) systems have been increasingly used to make decision-making processes faster, more accurate, and more efficient. However, such systems are also at constant risk of being attacked. While the majority of attacks targeting AI-based applications aim to manipulate classifiers or training data and alter the output of an AI model, recently proposed Sponge Attacks against AI models aim to impede the classifier's execution by consuming substantial resources. In this work, we propose \textit{Dual Denial of Decision (DDoD) attacks against collaborative Human-AI teams}. We discuss how such attacks aim to deplete \textit{both computational and human} resources, and significantly impair decision-making capabilities. We describe DDoD on human and computational resources and present potential risk scenarios in a series of exemplary domains.
translated by 谷歌翻译
成功培训端到端的深网进行真实运动去缩合,需要尖锐/模糊的图像对数据集,这些数据集现实且多样化,足以实现概括以实现真实的图像。获得此类数据集仍然是一项具有挑战性的任务。在本文中,我们首先回顾了现有的Deblurring基准数据集的局限性,从泛化到野外模糊图像的角度。其次,我们提出了一种有效的程序方法,以基于一个简单而有效的图像形成模型来生成清晰/模糊的图像对。这允许生成几乎无限的现实和多样化的培训对。我们通过在模拟对上训练现有的DeBlurring架构,并在四个真实模糊图像的标准数据集中对其进行评估,从而证明了所提出的数据集的有效性。我们观察到使用建议方法训练时动态场景的真实运动毛线照片的最终任务的出色概括性能。
translated by 谷歌翻译
尽管在时间序列重建的深度学习方法中取得了长足的进步,但由于其对优化损失的贡献可忽略不计,因此没有设计现有的方法来揭示具有微小信号强度的本地活动。但是,这种局部活动可以表示生理系统中重要的异常事件,例如额外的焦点触发心脏电波异常的传播。我们讨论了一种重建这种本地活动的新技术,尽管信号强度很小,但它是随后具有较大信号强度的全球活动的原因。我们的中心创新是通过明确建模并解开系统潜在的潜在隐藏内部干预措施的影响来解决此问题。在状态空间模型(SSM)的新型神经公式中,我们首先通过分别描述的相互作用的神经ODES系统引入潜在动力学的因果效应建模1)内部干预的连续时间动力学; 2)它对系统本地状态轨迹的影响。因为不能直接观察干预措施,而必须与观察到的后续效果脱离,所以我们整合了对系统的无天然干预动态的知识,并通过假设它是对实际观察到的差异来推断隐藏干预措施的推断和假设的无干预动态。我们证明了对重建异位焦点的提出框架的概念证明,从而破坏了从远程观察到正常心脏电气传播的过程。
translated by 谷歌翻译
噪声的去除或取消对成像和声学具有广泛的应用。在日常生活中,Denoising甚至可能包括对地面真理不忠的生成方面。但是,对于科学应用,denoing必须准确地重现地面真相。在这里,我们展示了如何通过深层卷积神经网络来定位数据,从而以定量精度出现弱信号。特别是,我们研究了晶体材料的X射线衍射。我们证明,弱信号是由电荷排序引起的,在嘈杂的数据中微不足道的信号,在DeNo的数据中变得可见和准确。通过对深度神经网络的监督培训,具有成对的低噪声数据,可以通过监督培训来实现这一成功。这样,神经网络就可以了解噪声的统计特性。我们证明,使用人造噪声(例如泊松和高斯)不会产生这种定量准确的结果。因此,我们的方法说明了一种实用的噪声过滤策略,可以应用于具有挑战性的获取问题。
translated by 谷歌翻译
由于模糊图像本身缺乏时间和纹理信息,因此非均匀的图像脱毛是一项具有挑战性的任务。来自辅助传感器的互补信息正在探索这些事件传感器以解决这些限制。后者可以异步记录对数强度的变化,称为事件,具有高时间分辨率和高动态范围。当前的基于事件的脱蓝晶方法将模糊图像与事件结合在一起,以共同估计每个像素运动和DeBlur操作员。在本文中,我们认为一种分裂和争议的方法更适合此任务。为此,我们建议使用调制可变形的卷积,其内核偏移和调制掩模是从事件中动态估算的,以编码场景中的运动,而从模糊图像和相应事件的组合中学习了deblur操作员。此外,我们采用了一种粗到十的多尺度重建方法来应对低对比度区域中事件的固有稀疏性。重要的是,我们介绍了第一个数据集,其中包含对曝光时间内的真实RGB模糊图像和相关事件的对。我们的结果在使用事件时显示出更好的总体鲁棒性,在合成数据上,PSNR的改进最多可提高1.57db,而对真实事件数据的改进则提高了1.08 dB。
translated by 谷歌翻译
决策对于自动驾驶的车道变化至关重要。强化学习(RL)算法旨在确定各种情况下的行为价值,因此它们成为解决决策问题的有前途的途径。但是,运行时安全性较差,阻碍了基于RL的决策策略,从实践中进行了复杂的驾驶任务。为了解决这个问题,本文将人类的示范纳入了基于RL的决策策略中。人类受试者在驾驶模拟器中做出的决定被视为安全的示范,将其存储到重播缓冲液中,然后用来增强RL的训练过程。建立了一个复杂的车道变更任务,以检查开发策略的性能。仿真结果表明,人类的演示可以有效地提高RL决策的安全性。而拟议的策略超过了其他基于学习的决策策略,就多种驾驶表演而言。
translated by 谷歌翻译
人工智能的最终目标之一是从原始数据中学习通用和人类解剖知识。神经符号推理方法通过使用手动设计的符号知识库改善神经网络的训练来部分解决此问题。在从原始数据中学到符号知识的情况下,该知识缺乏解决复杂问题所需的表现力。在本文中,我们介绍了神经符号归纳学习者(NSIL),该方法训练神经网络从原始数据中提取潜在概念,而学习符号知识可以解决复杂问题,该知识是根据这些潜在概念定义的。我们方法的新颖性是一种基于神经和符号成分的训练性能,使符号学习者偏向于学习改进的知识的方法。我们评估了两个问题领域的NSIL,这些问题领域需要具有不同级别的复杂性学习知识,并证明NSIL学习知识,而这些知识是不可能使用其他神经符号系统学习的知识,同时就准确性和数据效率而言优于基线模型。
translated by 谷歌翻译